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Abstract —Indoor scene recognition is a multi-faceted and 
challenging problem due to the diverse intra-class variations and 
the confusing inter-class similarities. This paper presents a novel 
approach which exploits rich mid-level convolutional features 
to categorize indoor scenes. Traditionally used convolutional 
features preserve the global spatial structure, which is a desirable 
property for general object recognition. However, we argue 
that this structuredness is not much helpful when we have 
large variations in scene layouts, e.g., in indoor scenes. We 
propose to transform the structured convolutional activations to 
another highly discriminative feature space. The representation 
in the transformed space not only incorporates the discriminative 
aspects of the target dataset, but it also encodes the features in 
terms of the general object categories that are present in indoor 
scenes. To this end, we introduce a new large-scale dataset of 1300 
object categories which are commonly present in indoor scenes. 
Our proposed approach achieves a significant performance boost 
over previous state of the art approaches on five major scene 
classification datasets. 

Index Terms —Scene classification, convolutional neural net¬ 
works, indoor objects dataset, feature representations, dictionary 
learning, sparse coding 


1. Introduction 

This paper proposes a novel method which captures the 
discriminative aspects of an indoor scene to correctly pre¬ 
dict its semantic category (e.g., bedroom, kitchen etc.). This 
categorization can greatly assist in context aware object and 
action recognition, object localization, robotic navigation and 
manipulation | [48| , | [49| . However, owing to the large vari¬ 
abilities between images of the same class and the confusing 
similarities between images of different classes, the automatic 
categorization of indoor scenes is a very challenging problem 
p4| , 14^ . Consider, for example, the images shown in Fig.[2 
The images of the top row (Fig.[2a) belong to the same class 
‘bookstore ’ and exhibit a large data variability in the form of 
object occlusions, cluttered regions, pose changes and varying 
appearances. The images in the bottom row (Fig. [^b) are of 
three different classes and have large visual similarities. A high 
performance classification system should therefore be able to 
cope with the inherently challenging nature of indoor scenes. 

To deal with the challenges of indoor scenes, previous 
works 1^, p^ , p4| , | [28| , p4] | propose to encode either 
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(a) All three ore very different looking “bookstore” Images. 
How can we take into account the high variability across 
indoor scenes of each scene type? 


(b) Image of a “Library”, a “Museum” and a “Church” 
(left to right}: How can we hope to learn the subtle 
differences between different scene types? 


Fig. 1: ‘Where am I located indoors?’, we want to answer 
this question by assigning a semantic class label to a given 
color image. An indoor scene categorization framework must 
take into account high intra-class variablity and should be able 
to tackle confusing inter-class similarities. This paper intro¬ 
duces a methodology to achieve these challenging requisites, 
(example images from MIT-67 dataset) 


local or global spatial and appearance information. In this 
paper, we argue that neither of those two representations 
provide the best answer to effectively handle indoor scenes. 
The global representations are unable to model the subtle 
details, and the low-level local representations cannot capture 
object-to-object relations and the global structures p0| , p4| , 
| [40| . We therefore devise mid-level representations that carry 
the necessary intermediate level of detail. These mid-level 
representations neither ignore the local cues nor lose the 
important scene structure and object category relationships. 

Our proposed mid-level representations are derived from 
densely and uniformly extracted image patches. In order to 
extract a rich feature representation from these patches, we use 
deep Convolutional Neural Networks (CNNs). CNNs provide 
excellent generic mid-level feature representations and have 
recently shown great promise for large-scale classification and 
detection tasks pQ| , They however tend to 

preserve the global spatial structure of the images which 
is not desirable when there are large intra-class variations e.g., 
in the case of indoor scene categorization (Fig. [^. We there¬ 
fore propose a method to discount this global spatial structure, 
while simultaneously retaining the intermediate scene structure 
which is necessary to model the mid-level scene elements. For 
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this purpose, we encode the extracted mid-level representations 
in terms of their association with codebook^ of Scene Rep¬ 
resentative Patches (SRPs). This enhances the robustness of 
the convolutional feature representations, while keeping intact 
their discriminative power. 

It is interesting to note that some previous works hint 
towards the incorporation of ‘wide context’ |T§, 1^, 1^ 
for scene categorization. Such high-level context-aware rea¬ 
soning has been shown to improve the classification perfor¬ 
mance. However in this work, we show that for the case of 
highly variant indoor-scenes, mid-level context relationships 
prove to be the most decisive factor in classification. The 
intermediate level of the scene details help in learning the 
subtle differences in the scene composition and its constituent 
objects. In contrast, global structure patterns can confuse the 
learning/classification algorithm due to the high inter-class 
similarities (Sec. |IV-C| ). 

As opposed to existing feature encoding schemes, we pro¬ 
pose to form multiple codebooks of SRPs. We demonstrate 
that forming multiple smaller codebooks (instead of one large 
codebook) proves to be more efficient and produces a better 
performance (Sec. IV-D ). Another key aspect of our feature 
encoding approach is the combination of supervised and 
unsupervised SRPs in our codebooks. The unsupervised SRPs 
are collected from the training data itself, while the super¬ 
vised SRPs are extracted from a newly introduced dataset of 
‘Object Categories in Indoor Scenes’ (OCIS). The supervised 
SRPs provide semantically meaningful information, while the 
unsupervised SRPs relate more to the discriminative aspects 
of the different scenes that are present in the target dataset. 
The efficacy of the proposed approach is demonstrated through 
extensive experiments on five challenging scene classification 
datasets. Our experimental results show that the proposed 
approach consistently achieves state of the art performance. 

The major contributions of this paper are: 1). We propose 
a new mid-level feature representation for indoor scene cate¬ 
gorization using large-scale deep neural nets (Sec. |I^, 2) Our 
feature description incorporates not only the discriminative 
patches of the target dataset but also the general object 
categories that are semantically meaningful (Sec. |III-C| ), 3). 
We collect the first large-scale dataset of object categories 
that are commonly present in indoor scenes. This dataset 
contains more than 1300 indoor object classes (Sec. |IV-A| ), 4). 
To improve the efficiency and performance of our approach, 
we propose to generate multiple smaller codebooks and a 
feasible feature encoding (Sec. [IIFcl i, and 5). We introduce a 
novel method to encode feature associations using max-margin 
hyper-planes (Sec. |III-D|). 


II. Related Work 

Based upon the level of image description, existing scene 
classification techniques can be categorized into three types: 
1). those which capture low level appearance cues, 2). those 
which capture the high level spatial structure of the scene and 
3). those which capture mid-level relationships. The techniques 
which capture low-level appearance cues j^, fT^ perform 

codebook is a collection of distinctive mid-level patches. 


poorly on the majority of indoor scene types since they fail to 
incorporate the high level spatial information. The techniques 
which model the human perceptible global spatial envelope 
also fail to cope with the high variability of indoor scenes. 
The main reason for the low performance of these approaches 
is their neglect of the fine-grained objects, which are important 
for the task of scene classification. 

Considering the need to extract global features as well as the 
characteristics of the constituent objects, Quattoni et al. p4| 
and Pandey et al. represented a scene as a combination 
of root nodes (which capture the global characteristics of the 
scene) and a set of regions of interest (which capture the 
local object characteristics). However, the manual or automatic 
identification of these regions of interest makes their approach 
indirect and thus complicates the scene classification task. 
Another example of indirect approach to scene recognition is 
the one proposed by Gupta et al. Q, where the grouping, 
segmentation and labeling outcomes are combined to rec¬ 
ognize scenes. Learned mid-level patches are employed for 
scene categorization by Juneja et al. |T3| , Doersch et al. 0 
and Sun et al. ED- However these works involve a lot of 
effort in learning the distinctive primitives which includes a 
discriminative patch ranking and selection. In contrast, our 
mid-level representation does not require any learning. Instead, 
we uniformly extract the mid-level patches densely from the 
images and show that these perform best when combined with 
supervised object representations. 

Deep Convolutional Neural Networks have recently shown 
great promise in large-scale visual recognition and classifica¬ 
tion 0, 1^, | [3T| , p6| , | [53| . Although CNN features have 
demonstrated their discriminative power for images with one 
or multiple instances of the same object, they do preserve 
the spatial structure of the image, which is not desirable 
when dealing with the variability of indoor scenes |T0| . CNN 
architectures involve max-pooling operations to deal with the 
local spatial variability in the form of rotation and translation 
eg However, these operations are not sufficient to cope 
with the large-scale deformations of objects and parts that are 
commonly present in indoor scenes p0| , In this work, 
we propose a novel representation which is robust to variations 
in the spatial structure of indoor scenes. It represents an image 
in terms of the association of its mid-level patches with the 
codebooks of the SRPs. 


HI. The Proposed Method 

The block diagram of our proposed pipeline called ‘Deep 
Un-structured Convolutional Activations (DUCA)’ is shown 
in Fig Our proposed method first densely and uniformly 
extracts mid-level patches (Sec III-A|), r epresents them by 
their convolutional activations (Sec in-B| ) and then encodes 
them in terms of their association with the codebooks of 
SRPs (Sec [IlLD| ), which are generated in supervised and 
unsupervised manners (Sec |III-C] ). The detailed description of 
each component of the proposed pipeline is presented next. 


A. Dense Patch Extraction 

To deal with the high variability of indoor scenes, we 
propose to extract mid-level instead of global or local 
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Fig. 2: Deep Un-structured Convolutional Activations: Given an input image, we extract from it dense mid-level patches, 
represent the extracted patches by their convolutional activations and encode them in terms of their association with the 
codebooks of Scene Representative Patches (SRPs). The designed codebooks have both supervised and unsupervised SRPs. 
The resulting associations are then pooled and the class belonging decisions are predicted using a linear classifier. 


HD feature representations. Mid-level representations do 
not ignore object level relationships and the discriminative 
appearance based local cues (unlike the high level global 
descriptors), and do not ignore the holistic shape and scene 
structure information (unlike the low level local descriptors). 
For each image, we extract dense mid-level patches using a 
sliding window of 224 x 224 pixels with a fixed step size of 
32. In order to extract a reasonable number of patches, the 
smaller dimension of the image is re-scaled to an appropriate 
length (700 pixels in our case). Note that the idea of dense 
patch extraction is analogous to dense key-point extraction 
p7| , which has shown very promising performance over well- 
designed key-point extraction methods in a number of tasks 
(e.g., action recognition |[46|). 

Before the dense patch extraction, we augment the images 
of the dataset with their flipped, cropped and rotated versions 
to the enhance generalization of our feature representation. 
First, five cropped images (four from the corners and one 
from the center) of | size are extracted from the original 
image. Each original image is also subjected to CW and CCW 
rotations of | radians and the resulting images are included 
in the augmented set. The horizontally flipped versions of all 
these eight images (1 original 5 cropped 2 rotated) are 
also included. The proposed data augmentation results in a 
reasonable performance boost (see Sec. |IV-D| ). 


B. Convolutional Feature Representations 

We need to map the raw image patches to a discriminative 
feature space where scene categories are easily separable. 
For this purpose, instead of using shallow or local feature 
representations, we use the convolutional activations from 
a trained deep CNN architecture. Learned representations 
based on CNNs have significantly outperformed hand-crafted 
representations in nearly all major computer vision tasks 
(TT). Our CNN architecture is similar to the ‘AlexNet’ l [l5| 
(trained on ILSVRC 2012) and consists of 5 convolutional 
and 3 fully-connected layers. The main difference compared 


to AlexNet is the dense connections between each pair of 
consecutive layers in the 8-layered network (in our case). The 
densely and uniformly extracted patches from the images, are 
fed to the network’s input layer after mean normalization. 
The processed output from the network is taken from an 
intermediate fully connected layer (7^^ layer). The resulting 
feature representation of each mid-level patch has a dimension 
of 4096. 

Although, CNN activations capture rich discriminative in¬ 
formation, they are inherently highly structured. The main 
reason is the sequence of operations involved in the hierar¬ 
chical layers of CNN which preserve the global spatial struc¬ 
ture of the image. This constraining structure is a limitation 
when dealing with highly variable indoor scene images. To 
address this, we propose to encode our patches (represented 
by their convolutional activations) to an alternate feature space 
which turns out to be even more discriminative (Sec. liiTPl ). 
Specifically, an image is encoded in terms of the association 
of its extracted patches with the codebooks of the Scene 
Representative Patches (SRPs). 


C. Scene Representative Patches (SRPs) 

An indoor scene is a collection of several distinct objects 
and concepts. We are interested in extracting a set of image 
patches of these objects and concepts, which we call ‘Scene 
Representative Patches’ (SRPs). The SRPs can then be used 
as elements of a codebook to characterize any instance of 
an indoor scene. Examples of these patches for a bedroom 
scene include a bed, wardrobe, sofa or a table. Designing a 
comprehensive codebook of these patches is a very challenging 
task. There can be two possible solutions: first, automatically 
learn to discover a number of discriminative patches from 
the training data and second, manually prepare an exhaustive 
vocabulary of all objects which can be present in indoor 
scenes. These solutions are quite demanding. Eirst, because of 
the possibility of a very large number of objects, and second, 
this may require automatic object detection, localization or 
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distinctive patch selection, which in itself is very challenging 
and computationally expensive. 

In this work, we propose a novel approach to compile a 
comprehensive set of SRPs. Our proposed approach avoids 
the drawbacks of the above mentioned strategies and success¬ 
fully combines their strengths i.e., it is computationally very 
efficient while being highly discriminative and semantically 
meaningful. Our set of SRPs has two main components, 
compiled in a supervised and an unsupervised manner. These 
components are described next. 

1) Supervised SRPs: A codebook of supervised SRPs is 
generated from images of well known object categories ex¬ 
pected to be present in a particular indoor scene (e.g., a 
microwave in a kitchen, a chair in a classroom). The codebook 
contains human-understandable elements which carry well- 
defined semantic meanings (similar to attributes 0 or object 
banks pQ|). In this regard, we introduce the first large- 
scale database of objects categories in indoor scenes (Sec. 
|IV-A| ). The introduced database includes an extensive set of 
indoor objects (more than 1300). The codebook of supervised 
SRPs is generated from images of the database by extracting 
dense mid-level patches after re-sizing the smallest dimension 
of each image to 256 pixels. The number of SRPs in the 
compiled codebook is equal to the object categories in the 
OCIS database. For this purpose, in the feature space, each 
SRP is a max-pooled version of convolutional activations 
(Sec |III-B| ) of all the mid-level patches extracted from that 
object category. The supervised codebook is then used in 
Sec. |III-D| to characterize a given scene image in terms of 
its constituent objects. 

2) Unsupervised SRPs: The codebook of unsupervised 
SRPs is generated from the patches extracted from the training 
data. First, we densely and uniformly extract patches from 
training images by following the procedure described in Sec. 


III-A The SRPs can then be generated from these patches 


using any unsupervised clustering technique. However, in our 
case, we randomly sample the patches as our unsupervised 
SRPs. This is because, we are dealing with a very large 
number of extracted patches and an unsupervised clustering 
can be computationally prohibitive. We demonstrate in our 
experiments (Sec. |IV-C| |IV-D| ) that random sampling does not 
cause any noticeable performance degradation, while achiev¬ 
ing significant computational advantages. 

Ideally, the codebook of SRPs should be all-inclusive and 
cover all discriminative aspects of indoor scenes. One might 
therefore expect a large number of SRPs in order to cover 
all the possible aspects of various scene categories. While 
this is indeed the case, feature encoding from a single large 
codebook would be computationally burdensome (Sec. IV-D ). 
We therefore propose to generate multiple codebooks of rel¬ 
atively smaller sizes. The association vectors from each of 
these codebooks can then be concatenated to generate a high 
dimensional feature vector. This guarantees the incorporation 
of a large number of SRPs at a low computational cost. To 
this end, we generate three unsupervised codebooks, each with 
3000 SRPs. The codebook size was selected empirically on a 
small validation set. 

The SRPs in the supervised codebook are semantically 


meaningful, however, they do not include all possible aspects 
of the different scene categories. The unsupervised codebook 
compensates this shortcoming and complements the super¬ 
vised codebook. The combinations of both supervised and 
unsupervised codebooks, results therefore in an improved 
discrimination and accuracy (see Sec. |IV-D|). 


D. Feature Encoding from SRPs 


Given an RGB image I G 


vHxWxS 


our task is to find 


its feature representation in terms of the previously generated 


codebooks of SRPs (Sec. |III-CT] and III-C2|) . For this purpose, 

colir S CL TU 2z4 X 224 X 3 \ rr-rAm 


we first densely extract patches G 1 ^ 224 x 224 x3from 
the image using the procedure explained in Sec. |III-A[ Next, 
the patches are represented by their convolutional activations 
as discussed in Sec. |III-B[ The patches are then encoded in 
terms of their association with the SRPs of the codebooks. 
The following two strategies are devised for this purpose. 

1) Sparse Linear Coding: Let X G M^^gexm ^ code¬ 
book of m SRPs, a mid-level patch is sparsely recon¬ 
structed from the SRPs of the codebook using: 


mm 

/(O 




M) 


+ A 


fii) 


(1) 


A is the regularization constant. The sparse coefficient vector 
is then used as the final feature representation of the patch. 

2) Proposed Classifier Similarity Metric Coding: We pro¬ 
pose a new soft encoding method which uses the maximum 
margin hyper-planes to measure feature associations. Given 
a codebook of m SRPs, we train m linear binary one-vs- 
all SVMs. An SVM finds the maximum margin hyperplane 
which optimally discriminates an SRP from all others. Let 
W G M^ogexm learnt weight matrix of all learnt SVMs. 

A patch can then be encoded in terms of the trained SVMs 
using: Since we have multiple codebooks (K 

in total), the patch is separately encoded from all of them. 
The final representation of is then achieved by concate¬ 
nating the encoded feature representation from all codebooks 
into a single feature vector 




Ai) 
■ Jk 


E. Classification 

The encoded feature representations from all mid-level 
patches of the image are finally pooled to produce the overall 
feature representation of the image. Two commonly used pool¬ 
ing strategies (mean pooling and max pooling) are explored in 
our experiments (see Sec. |IV-D| ). Finally, in order to perform 
classification, we use one-vs-one linear SVMs. 

min^ww^ + C ^max(0,1 — /'•')) . (2) 

i 

Where v^ is the normal vector to the learned max-margin 
hyper-plane, C is the regularization parameter and is the 
binary class label of the feature vector 
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CMC Curve on OCIS Dataset 



Fig. 3: CMC Curve for 
the benchmark evalua¬ 
tion on OCIS dataset. 
The curve illustrates 
the challenging nature 
of the dataset. 


IV. Experiments and Evaluation 


We evaluate our approach on three indoor scene classifi¬ 
cation datasets. These include MIT-67 dataset, 15 Category 
Scene data set and NYU indoor scene dataset. Confusing 
inter-class similarities and high within-class variabilities make 
these datasets very challenging. Specifically, MIT-67 is the 
largest dataset of indoor scene images containing 67 classes. 
The images of many of these classes are very similar looking 
e.g., inside-subway and inside-bus (see Fig for example 
confusing and challenging images). Moreover, we also report 
results on two event and object classification datasets (Graz- 
02 dataset and 8-Sports event dataset) to demonstrate that 
the proposed technique is applicable to other related tasks. 
A detailed description of each of these datasets, followed 
by our experimental setups and the corresponding results 
are presented in Sec. |IV-B and IV-C| First, we provide a 
description of our introduced OCIS dataset below. 


A. A Dataset of Object Categories in Indoor Scenes 

There is an exhaustive list of scene elements (including 
objects, structures and materials) that can be present in indoor 
scenes. Any information about these scene elements can prove 
crucial for the scene categorization task (and even beyond 
- e.g., for the semantic labeling or attribute identification). 
However, to the best of our knowledge, there is no publicly 
available dataset of these indoor scene elements. In this paper, 
we introduce the first large-scale OCIS {Object Categories 
in Indoor Scenes) database. The database contains a total of 
15324 images spanning more than 1300 frequently occurring 
indoor object categories. The number of images in each cate¬ 
gory is about 11. The database can potentially be used for fine¬ 
grained scene categorization, high-level scene understanding 
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Fig. 4: A word cloud of the top 300 most frequently occurring 
classes in our introduced Object Categories in Indoor Scenes 
(OCIS) database. 


and attribute based reasoning. In order to collect the data, 
a comprehensive list of 1325 indoor objects was manually 
chosen from the labelings provided with the MIT-67 p4| 
dataset. This taxonomy includes a diverse set of objects classes 
ranging from a 'house' to a 'handkerchief. A word cloud of 
the top 300 most frequently occurring classes is shown in Fig. 
1^ The images for each class are then collected using an online 
image search (Google API). Each image contains one or more 
instances of a specific object category. In order to illustrate the 
diverse intra-class variability of this database, we show some 
example images in Fig. Our in-house annotated database 
will be made freely available to the research community. 

For the benchmark evaluation, we represent the images of 
the database by their convolutional features and feed them 
to a linear classifier (SVM). A train-test split of 66%-33% 
is defined for each class. The classification results in terms 
of the Cumulative Match Curve (CMC) are shown in Fig. 
The rank-1 and rank-20 identification rates turn out to be only 
32% and 67% respectively. These modest classification rates 
suggest that indoor object categorization is a very challenging 
task. 


B. Evaluated Datasets 

The performance of our proposed method is evaluated on 
MIT-67 dataset, 15 Category Scene data set, NYU Indoor 
Scene dataset, Graz-02 dataset and 8-Sports event dataset. 
Below, we present a brief description of each of these datasets 
followed by an analysis on the achieved performance. 

1) MIT-67 Dataset: It contains 15620 images of 67 indoor 
categories. For performance evaluation and comparison, we 
followed the standard evaluation protocol in p4| in which a 
subset of data is used (100 images per class) and a train-test 
split is defined to be 80% — 20% for each class. 

2) 15 Category Scene Dataset: It contains images of 15 
urban and natural scene categories. The number of images in 
each category ranges from 200-400. For our experiments, we 
use the same evaluation setup as in p^ , where 100 images 
per class are used for training and the rest for testing. 

3) NYU vl Indoor Scene Dataset: It consists of 7 indoor 
scene categories with a total of 2347 images. Following the 
standard experimental protocol we used a 60% — 40% 
train/test split for evaluation. Care has been taken while 
splitting the data to ensure that a minimal or no overlap of 
the consecutive frames exists between the training and testing 
sets. 

4) Inria Graz-02 Dataset: It consists of 1096 images be¬ 
longing to 3 classes (bikes, cars and people) in the presence of 
heavy clutter, occlusions and pose variations. For performance 
evaluation, we used the protocol defined in | [25| . Specifically, 
for each class, the first 150 odd images are used for training 
and the 150 even images are used for testing. 

5) UIUC 8-Sports Event Dataset: It contains 1574 images 
of 8 sports categories. Following the protocol defined in |T9| , 
we used 70 randomly sampled images for training and 60 for 
testing. 
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Fig. 5: Example images from the ‘Object Categories in Indoor Scenes’ dataset. This dataset contains a diverse set of object 
classes with different sizes and scales (e.g, Alcove and Melon). Each category includes a rich set of images with differences 
in appearance, shape, viewpoint and background. 



MIT-67 Indoor Scenes Dataset 


Method 

Accuracy (%) Method 

Accuracy (%) 


ROI + GIST [CVPR’0 91 26.1 

MM-Scene [NIPS’10] 28.3 

SPM [CVPR’06] 34.4 

Object Bank [NIpWd] 37.6 

RBoW [CVPR’12] 37.9 

Weakly Supervised DPM [ICCV’ll] 43.1 

SPMSM [ECCV’12] 44.0 

LPR-LIN [ECCV’12]1^ 44.8 

BoP [CVPR’13] 46.1 

Hybrid Parts + GIST + SP [ECCV’12] 47.2 


OTC [ECCV’14] 47.3 

Discriminative Patches [ECCV’12] (40] 49.4 

ISPR [CVPR’14] IB] 50.1 

D-Parts [ICCV’lSrml 51.4 

VC + VQ [CVPR’iV[B] 52.3 

lEV [CVPR’13] 60.8 

MLRep [NIPS’ lijim 64.0 

CNN-MOP [ECCV^] 68.9 

CNNaug-SVM [CVPRwT4] 69.0 


Proposed DUCA 71.8 


TABLE I: Mean accuracy on the MIT-67 Indoor Scenes Dataset. Comparisons with the previous state-of-the-art methods are 
also shown. Our approach performs best in comparison to techniques which use a single or multiple feature representations. 



15 Category Scene Dataset 


Method 

Accuracy (%) Method 

Accuracy (%) 


GIST-color [IJCV’OG 
RBoW [CVPR’12] 
Classemes [ECCV’lT 
Object Bank [NIPS’10]^ 
SPM [CVPR’06] 
SPMSM [ECCV’ 

LCSR [CVPR’12] 
SP-pLSA [PAMTO: 
CENTRIST [PAMITf 
HIK [ICCV’09] 

OTC [ECCV’14] f24 


69.5 

78.6 

80.6 
80.9 
81.4 
82.3 
82.7 


ISPR [CVPR’14] m 
VC + VQ [CVPR%^ 

LMLE [CVPR’IO] (2 
LPR-RBE [ECCV’lV 
Hybrid Parts + GIST [ECCV’12] 
CENTRIST+LCC+Boosting [CVPR’ll]' 
RSP [ECCV’12] 




85.1 
85.4 
85.6 

85.8 
86.3 

87.8 

88.1 



83.7 

lEV E3 

89.2 

T~|4^ 

83.9 

LScSPM [CVPR’IO] 0 

89.7 


84.1 



84.4 

Proposed DUCA 

94.5 


TABLE II: Mean accuracy on the 15 Category Scene Dataset. Comparisons with the previous best techniques are also shown. 


C. Experimental Results 


The quantitative results of the proposed method for the 
task of indoor scene categorization are presented in Tables 
OQ and 1^ The proposed method achieves the highest 
classification rate on all three datasets. Compared with the 
existing state of the art, a relative performance increment of 
4.1%, 5.4% and 1.3% is achieved for MIT-67, Scene-15 and 
NYU datasets respectively. Amongst the compared methods, 
the mid-level feature representation based methods 
ED perform better than the others. Our proposed mid-level 
features based method not only outperforms their accuracy 
but is also computationally efficient (e.g., 0 takes weeks to 
train several part detectors). Furthermore, once compared with 


existing methods, our proposed method uses a lower dimen¬ 
sional feature representation for classification (e.g., the Juneja 
et al. ED Improved Fisher Vector (IFV) has dimensionality 
> 200K; the Gong et al. 0 MOP representation has > 12K). 

In addition to indoor scene classification, we also evaluate 
our approach on other scene classification tasks where large 
variations and deformations are present. To this end, we report 
the classification results on the UIUC 8-Sports dataset and the 
Graz-02 dataset (see Tables [nl| and \V \ . It is interesting to note 
that the Graz-02 dataset contains heavy clutter, pose and scale 
variations (e.g., for some ‘car’ images only 5% of the pixels 
are covered by the car in a scene). Our approach achieved high 
accuracies of 98.7% and 98.6% respectively on the UIUC 8- 
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UIUC 8-Sports Dataset 

Method 

Accuracy (%) 


GIST-color [IJCV’Ol] @ 70.7 

MM-Scene [NIPS’10] 71.7 

Graphical Model [ICC’Hv] 73.4 

Object Bank [NIPS’10] 1^ 76.3 

Object Attributes [ECC’Sni] 1^ 77.9 

CENTRIST [PAMI’ll] 1^ 78.2 

RSP [ECCV’12] 79.6 

SPM [CVPR’06] 81.8 

SPMSM [ECCV’ITt ITtI 83.0 

Classemes [ECCV’lOn^ 84.2 

HIK [ICCV’09] 84.2 

LScSPM [CVPRTB] 85.3 

LPR-RBE [ECCV’12]^T^ 86.2 

Hybrid Parts + GIST [ECCV’12] 87.2 

LCSR [CVPR’12] 87.2 

VC + VQ [CVPR’ferl^ 88.4 

lEV 90.8 

ISPrVa^PR’14] 89.5 


Proposed DUCA 


98.7 


TABLE III: Mean accuracy on the UIUC 8-Sports Dataset. 


NYU Indoor Scenes Dataset 

Method 


Accuracy (%) 

BoW-SIFT [ICCVw’ll 

1 

55.2 

RGB-LLC [TC’13] M 

3] 

78.1 

RGB-LLC-RPSL [TCH 

79.5 

Proposed DUCA 


80.6 



Children room, Kindergarten Cloister, Corridor Movie theatre. Auditorium 



Subway, TV studio Bookstore, Library Mall, Train station 


Fig. 6: Examples mistakes and the limitations of our method. 
Most of the incorrect predictions are due to ambiguous cases. 
The actual and predicted class names are shown in 'blue ’ and 
'red' respectively. (Best viewed in color) 


classes with significant visual and semantic similarities are 
confused amongst each others e.g., childrenroom-kindergarten 
and movietheatre-auditorium (Fig. [^. 

In order to visualize which patches contributed most towards 
a correct classification, we plot the heat map of the patch 
contribution scores in Fig. E It turns out that the most 
distinctive patches, which carry valuable information, have a 
higher contribution towards the correct prediction of a scene 
class. Moreover, mid-level patches carry an intermediate level 
of scene details and contextual relationships between objects 
which help in the scene classification process. 


TABLE IV: Mean Accuracy for the NYU vl Dataset. 


Sports and Graz-02 datasets. These performances are 10.2% 
and 12.6% higher than the previous best methods on UIUC 
8-Sports and Graz-02 datasets respectively. 

The class-wise classification accuracies of the MIT-67, 
UIUC 8-Sports, Scene-15 and NYU datasets are shown in the 
form of confusion matrices in Fig. and [^respectively. Note 
the very strong diagonal in all confusion matrices. The major¬ 
ity (> 90%) of the mistakes are made for the closely related 
classes e.g., coast-opencountry (Fig. [Taj ), croquet-bocce (Fig. 
[7b| ), bedroom-livingroom (Fig.[7c|), dentaloffice-operatingroom 
(Fig.|3 and library-bookstore (FigJ^. We also show examples 
of miss-classified images in Fig. [fipThe results show that the 


Graz-02 Dataset 


Cars 

People 

Bikes 

Overall 

OLB [SCIA’05] 

70.7 

81.0 

76.5 

76.1 

VQ [ICCV’07] 

80.2 

85.2 

89.5 

85.0 

ERC-F [PAMrofeT™ 

1 79.9 

- 

84.4 

82.1 

TSD-IB [BMVC’ll^ 

g 87.5 

85.3 

91.2 

88.0 

TSD-k [BMVC’ll] 

^ 84.8 

87.3 

90.7 

87.6 

Proposed DUCA 

98.7 

98.0 

99.0 

98.6 


TABLE V: Equal Error Rates (EER) for the Graz-02 dataset. 


D. Ablative Analysis 

To analyze the effect of the different components of the 
proposed scheme on the final performance, we conduct an 
ablation study. Table [Vl| summarizes the results achieved on 
the MIT-67 Scenes dataset when different components were 
replaced or removed from the final framework. 

It turns out that the supervised and unsupervised codebooks 
individually perform reasonably well. However, their combi¬ 
nation gives state of the art performance. For the unsuper¬ 
vised codebook, k-means clustering performs slightly better 
however, at the cost of a considerable amount of compu¬ 
tational resources (~ 40GB RAM for the MIT-67 dataset) 
and processing time (^ 1 day for the MIT-67 dataset). In 
contrast, the random sampling of MRPs gives a comparable 
performance with a big boost in computational efficiency. 
Feature encoding from a single large codebook does not 
only produce a lower performance, but it also requires more 
computational time and memory. In our experiments, feature 
encoding from one single large codebook requires almost 
twice the time (^ 45 sec/image) taken by multiple smaller 
codebooks (^25 sec/image). The resulting features performed 
best when the max-pooling operation was applied to combine 
them. 


V. Conclusion 

This paper proposed a robust feature representation based on 
discriminative mid-level convolutional activations, for highly 
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(a) 15 Categories Scenes Dataset 


(b) UIUC 8-Sports Dataset 


(c) NYU Indoor Scenes Dataset 


Fig. 7: Confusion matrices for three scene classification datasets. (Best viewed in color) 
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Fig. 8: The contribution of distinctive patches for the correct class prediction of a scene are shown in the form of a heat map 
('red' means more contribution). These examples show that our approach captures the discriminative properties of distinctive 
mid-level patches and uses them to predict the correct class. (Best viewed in color) 


Variants of Our Approach 

Accuracy (%) 

Supervised codebook 

68.5 

Unsupervised codebook 

69.9 

Supervised -i- Unsupervised 

71.8 

K-means clustering 

72.0 

Random sampling 

71.8 

Single large codebook 

71.4 

Multiple smaller codebooks 

71.8 

Sparse linear coding 

71.8 

Classifier similarity metric coding 

69.9 

Mean-poling 

69.7 

Max-pooling 

71.8 

Original data 

69.1 

Data augmentation 

71.8 


TABLE VI: Ablative Analysis on MIT-67 Scene Dataset. 


variable indoor scenes. To suitably contrive the convolutional 
activations for indoor scenes, the paper proposed to break 
their inherently preserved global spatial structure by encoding 
them in a number of of multiple codebooks. These codebooks 
are composed of distinctive patches and of the semantically 
labeled elements. For the labeled elements, we introduced the 
first large-scale dataset of object categories of indoor scenes. 
Our approach achieves state-of-the-art performance on five 
very challenging datasets. 
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